ReCycle: Resilient Training of Large DNNs using Pipeline Adaptation

作者信息Swapnil Gandhi, Mark Zhao, Athinagoras Skiadopoulos, Christos Kozyrakis Stanford

链接:[2405.14009] ReCycle: Resilient Training of Large DNNs using Pipeline Adaptation

摘要:Training large Deep Neural Network (DNN) models requires thousands of GPUs over the course of several days or weeks. At this scale, failures are frequent and can have a big impact on training throughput【大规模GPU集群中故障发生的频率很高】. Utilizing spare GPU servers to mitigate performance loss becomes increasingly costly as model sizes grow.【在大规模集群中使用备用的GPU服务器来处理这个问题会非常昂贵】 ReCycle is a system designed for efficient DNN training in the presence of failures, without relying on spare servers.【Recycle:针对容错情况且不适用备用服务器的DNN训练系统】 It exploits the inherent functional redundancy in distributed training systems -- where servers across data-parallel groups store the same model parameters【利用了DP下各机器存储模型一致】 -- and pipeline schedule bubbles within each data-parallel group【利用了PP的bubble】. When servers fails, ReCycle dynamically re-routes micro-batches to data-parallel peers, allowing for uninterrupted training despite multiple failures【假如错误发生,Recycle会进行重新路由】. However, this re-routing can create imbalances across pipeline stages, leading to reduced training throughput【而重新路由又会给流水线带来不均衡】. To address this, ReCycle introduces two key optimizations that ensure re-routed micro-batches are processed within the original pipeline schedule's bubbles. First, it decouples the backward pass into two phases: one for computing gradients for the input and another for calculating gradients for the parameters【将backward解耦】. Second, it avoids synchronization across pipeline stages by staggering the optimizer step【避免了优化器的同步】. Together, these optimizations enable adaptive pipeline schedules that minimize or even eliminate training throughput degradation during failures【通过这种方式,实现了更细粒度的调度,运用了bubble,从而尽量减少了因为错误发生而造成训练吞吐量的下降】. We describe a prototype for ReCycle and show that it achieves high training throughput under multiple failures, outperforming recent proposals for fault-tolerant training such as Oobleck and Bamboo by up to 1.46× and 1.64×, respectively.

总结概括

Introduction

1. 并行训练

image-20250209200233556

2. Fault

  • 错误的训练中非常常见,且浪费了大量的资源

    • large-scale training clusters at Microsoft see a failure every ≈ 45 minutes [30].
    • Meta encountered over 100 hardware failures while training OPT-175B, resulting in the loss of 178,000 GPU-hours.
  • 增强分布式容错能力

Motivation

  1. 可以利用DP的功能冗余

image-20250209203020642

  1. PP可以通过bubble来更好地利用硬件计算能力

    image-20250209203349067

    但是,这种方案还是存在很明显的不足,替其他worker执行任务会带来明显的性能影响。这种方案一个很明显的缺陷就是中间的bubble显著增多了。

  2. Backward的性质

    image-20250209203802543

    Binput是后续流水线需要的,Bweight则是权重更新,可以最后再进行。

    image-20250209204055508

    通过这种方案,中间的流水线bubble和新增加的时间步都显著减少了。然而,这种方案又会带来一个新的问题:会带来额外的内存压力。

  3. 有选择的反向传播解耦

    可以查看W0_2的micro-batch 1,假如将反向传播解耦的话,需要等待至25步的时候1的权重才被更新,这一中间变量才可以释放。为了避免OOM,需要进行有选择的反向传播解耦,只有内存足够的情况下才会进行解耦。

  4. 通过异步的优化器利用Warm up的bubble

    Cool down的bubble可以通过解耦反向计算来实现更高的利用率。然而,Warm up部分的bubble还是无法被有效利用到。经历一次完整的前向反向计算,就会执行优化器更新模型参数。本工作引入了一种异步的优化器方案,将Warm up的bubble迁移到了Cool down阶段。

    image-20250209211822685

    有人批评这种方案bubble不会这么多,同时,还有另外一篇一模一样的思想在ai方向发了Arxiv

  5. 动态调整以支持更多的Fault

    Recycle会动态调整机器,以实现更好的容错能力。

image-20250209213833975

Design

image-20250209214106749

  1. Failure Normalization

    根据不同的错误数目,先算出来各个流水线上最优错误分布,假如不符合实际,则可以进行替换,直至满足最优错误分布。并采用DP的方式,生成了所有给定的错误数范围内的最佳分配。

    image-20250209222536903

  2. Adaptive Schedule Generation

    • 定义

      • 引进了一个五维变量来代表流水线worker变化的操作,data parallelism 索引,stage id,运算类型,从第几个pipeline parallelism 索引迁移到第几个pipeline parallelism 索引;
      • 引进了操作是否为真S∈{0,1}
      • 引进了时间开销T,内存占用M;
      • 引进了操作先后顺序O∈{0,1},O Binput在O F之后;
      • 引进了该操作训练结束时间E。
    • 目标

      • 最小化最后一个操作的结束时间。
    • 约束

      • 先后计算顺序:

        • 先进行Stage-1的forward,再进行Stage的forward;先进行Stage+1的backward,再进行Stage的backward

          image-20250209230111991

        • backward计算中B input必须早于B weight

          image-20250209230252639

        • 同一个worker上的计算时间不允许重叠

          image-20250209230653167

        • 内存不能超过限制

          image-20250209230740456

Evaluation

思考角度

我如何做这个问题

这个洞见可以引申出其他其他方法吗

该洞见是否可以迁移到其他领域中

该工作有什么可能可以改进的地方

Q&A

这种思想在更早的2023年11月30日就在Arxiv上出现了[2401.10241] Zero Bubble Pipeline Parallelism

results matching ""

    No results matching ""